The inpatientCharges.csv dataset is provided by centers for medicare and mediaid services. It contains discharges and cost by each hospital across different states and geographical zones for various Diagnosis Related Group (DRG).
We shall be preparing and analysing the data to detect anomalies, it identify any abuse of hospital resources for monetary gain. By benchmarking common practices and filtering data beyond acceptable standards.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Reading the csv file 26MB
df = pd.read_csv('inpatientCharges.csv')
Rename columns for easy of access and a glimpse into the data
df.columns = ['DRG','Provider_Id', 'Provider_Name','Provider_StreetAddress','Provider_City',
'Provider_State','Provider_Zipcode','Hospital_referral_region_desp',
'Total_Discharges','Average_Covered_Charges','Average_Total_Payments',
'Average_Medicare_Payment']
df
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | 5777.24 | 4763.73 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | 5787.57 | 4976.71 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | 5434.95 | 4453.79 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | 5417.56 | 4129.16 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | 5658.33 | 4851.44 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 163060 | 948 - SIGNS & SYMPTOMS W/O MCC | 670041 | SETON MEDICAL CENTER WILLIAMSON | 201 SETON PARKWAY | ROUND ROCK | TX | 78664 | TX - Austin | 23 | 26314.39 | 3806.86 | 3071.39 |
| 163061 | 948 - SIGNS & SYMPTOMS W/O MCC | 670055 | METHODIST STONE OAK HOSPITAL | 1139 E SONTERRA BLVD | SAN ANTONIO | TX | 78258 | TX - San Antonio | 11 | 21704.72 | 4027.36 | 2649.72 |
| 163062 | 948 - SIGNS & SYMPTOMS W/O MCC | 670056 | SETON MEDICAL CENTER HAYS | 6001 KYLE PKWY | KYLE | TX | 78640 | TX - Austin | 19 | 39121.73 | 5704.36 | 4058.36 |
| 163063 | 948 - SIGNS & SYMPTOMS W/O MCC | 670060 | TEXAS REGIONAL MEDICAL CENTER AT SUNNYVALE | 231 SOUTH COLLINS ROAD | SUNNYVALE | TX | 75182 | TX - Dallas | 11 | 28873.09 | 7663.09 | 6848.54 |
| 163064 | 948 - SIGNS & SYMPTOMS W/O MCC | 670068 | TEXAS HEALTH PRESBYTERIAN HOSPITAL FLOWER MOUND | 4400 LONG PRAIRIE ROAD | FLOWER MOUND | TX | 75028 | TX - Dallas | 12 | 15042.00 | 3539.75 | 2887.41 |
163065 rows × 12 columns
Column Names and Data types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 163065 entries, 0 to 163064 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DRG 163065 non-null object 1 Provider_Id 163065 non-null int64 2 Provider_Name 163065 non-null object 3 Provider_StreetAddress 163065 non-null object 4 Provider_City 163065 non-null object 5 Provider_State 163065 non-null object 6 Provider_Zipcode 163065 non-null int64 7 Hospital_referral_region_desp 163065 non-null object 8 Total_Discharges 163065 non-null int64 9 Average_Covered_Charges 163065 non-null float64 10 Average_Total_Payments 163065 non-null float64 11 Average_Medicare_Payment 163065 non-null float64 dtypes: float64(3), int64(3), object(6) memory usage: 14.9+ MB
About the data
df.describe(include = 'all')
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163065 | 163065.000000 | 163065 | 163065 | 163065 | 163065 | 163065.000000 | 163065 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 |
| unique | 100 | NaN | 3201 | 3326 | 1977 | 51 | NaN | 306 | NaN | NaN | NaN | NaN |
| top | 194 - SIMPLE PNEUMONIA & PLEURISY W CC | NaN | GOOD SAMARITAN HOSPITAL | 100 MEDICAL CENTER DRIVE | CHICAGO | CA | NaN | CA - Los Angeles | NaN | NaN | NaN | NaN |
| freq | 3023 | NaN | 633 | 183 | 1505 | 13064 | NaN | 3653 | NaN | NaN | NaN | NaN |
| mean | NaN | 255569.865428 | NaN | NaN | NaN | NaN | 47938.121908 | NaN | 42.776304 | 36133.954224 | 9707.473804 | 8494.490964 |
| std | NaN | 151563.671767 | NaN | NaN | NaN | NaN | 27854.323080 | NaN | 51.104042 | 35065.365931 | 7664.642598 | 7309.467261 |
| min | NaN | 10001.000000 | NaN | NaN | NaN | NaN | 1040.000000 | NaN | 11.000000 | 2459.400000 | 2673.000000 | 1148.900000 |
| 25% | NaN | 110092.000000 | NaN | NaN | NaN | NaN | 27261.000000 | NaN | 17.000000 | 15947.160000 | 5234.500000 | 4192.350000 |
| 50% | NaN | 250007.000000 | NaN | NaN | NaN | NaN | 44309.000000 | NaN | 27.000000 | 25245.820000 | 7214.100000 | 6158.460000 |
| 75% | NaN | 380075.000000 | NaN | NaN | NaN | NaN | 72901.000000 | NaN | 49.000000 | 43232.590000 | 11286.400000 | 10056.880000 |
| max | NaN | 670077.000000 | NaN | NaN | NaN | NaN | 99835.000000 | NaN | 3383.000000 | 929118.900000 | 156158.180000 | 154620.810000 |
missing_values = df.isnull().sum()
missing_values_per = (df.isnull().sum() / df.isnull().count())
pd.concat([missing_values, missing_values_per],axis=1, keys = ['Count_of_Missing','Percentage'])
| Count_of_Missing | Percentage | |
|---|---|---|
| DRG | 0 | 0.0 |
| Provider_Id | 0 | 0.0 |
| Provider_Name | 0 | 0.0 |
| Provider_StreetAddress | 0 | 0.0 |
| Provider_City | 0 | 0.0 |
| Provider_State | 0 | 0.0 |
| Provider_Zipcode | 0 | 0.0 |
| Hospital_referral_region_desp | 0 | 0.0 |
| Total_Discharges | 0 | 0.0 |
| Average_Covered_Charges | 0 | 0.0 |
| Average_Total_Payments | 0 | 0.0 |
| Average_Medicare_Payment | 0 | 0.0 |
features = ['Total_Discharges','Average_Covered_Charges','Average_Total_Payments','Average_Medicare_Payment','Provider_Zipcode']
for i in features:
plt.figure(figsize=(15,5))
plt.figure(i)
sns.histplot(df[i])
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
for i in features:
plt.figure(figsize=(10,5)) # Plot a graph with reasonable size.
sns.boxplot(x = i, data = df)
Following we display the count of charges for the various geographic variables.
plt.figure(figsize=(15,5))
sns.countplot(x='Provider_State',data=df,order=df['Provider_State'].value_counts().index)
plt.rcParams["axes.labelsize"] = 20
plt.figure(figsize=(15,5))
sns.countplot(x='Provider_City',data=df,order=df['Provider_City'].value_counts()[:25].index)
plt.rcParams["axes.labelsize"] = 20
plt.xticks(rotation = 90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24]),
[Text(0, 0, 'CHICAGO'),
Text(1, 0, 'BALTIMORE'),
Text(2, 0, 'HOUSTON'),
Text(3, 0, 'PHILADELPHIA'),
Text(4, 0, 'BROOKLYN'),
Text(5, 0, 'SPRINGFIELD'),
Text(6, 0, 'COLUMBUS'),
Text(7, 0, 'LOS ANGELES'),
Text(8, 0, 'NEW YORK'),
Text(9, 0, 'DALLAS'),
Text(10, 0, 'LAS VEGAS'),
Text(11, 0, 'JACKSON'),
Text(12, 0, 'JACKSONVILLE'),
Text(13, 0, 'COLUMBIA'),
Text(14, 0, 'WASHINGTON'),
Text(15, 0, 'CINCINNATI'),
Text(16, 0, 'MIAMI'),
Text(17, 0, 'CLEVELAND'),
Text(18, 0, 'BOSTON'),
Text(19, 0, 'INDIANAPOLIS'),
Text(20, 0, 'PITTSBURGH'),
Text(21, 0, 'OKLAHOMA CITY'),
Text(22, 0, 'RICHMOND'),
Text(23, 0, 'PHOENIX'),
Text(24, 0, 'SAINT LOUIS')])
We attempt to examine the correlation between the variables and find that the charges a highly correlated
corr = df.corr()
# Set up the matplotlib figure
ax = sns.heatmap(
corr,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(20, 230, n=150),
square=True, linewidths=.5, cbar_kws={"shrink": .5}
)
sns.set(rc={'figure.figsize':(10,10)})
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right'
);
var_list = df.columns
sns.pairplot(df[var_list])
<seaborn.axisgrid.PairGrid at 0x15c225a2730>
Calculating the average amount spent per DRG irrespective of a state. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons across states
benchmark = df.groupby(['DRG'])['Average_Total_Payments'].mean().reset_index()
benchmark.columns = ['DRG','Avg_Payment_ByDRG']
benchmark.head()
| DRG | Avg_Payment_ByDRG | |
|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 6960.534004 |
| 1 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 6706.276445 |
| 2 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 13263.823032 |
| 3 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 7922.671141 |
| 4 | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 5713.985221 |
# add the new feature average spending to the dataset
df1 = pd.merge(df, benchmark, how='left', on=['DRG'])
# Feature 1: ratio of amount spending and average spending by drg
df1['drg_avg_ratio_amount'] = np.where(df1['Avg_Payment_ByDRG']==0,0, df1['Average_Total_Payments'] / df1['Avg_Payment_ByDRG'])
df1['drg_avg_ratio_amount'].describe()
count 163065.000000 mean 1.000000 std 0.248259 min 0.464005 25% 0.839686 50% 0.928391 75% 1.085989 max 9.449417 Name: drg_avg_ratio_amount, dtype: float64
len(df1[df1['drg_avg_ratio_amount']>2]['Provider_Name'].unique())
195
df1[df1['drg_avg_ratio_amount']>8]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | Avg_Payment_ByDRG | drg_avg_ratio_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20388 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 390096 | ST JOSEPH MEDICAL CENTER | 2500 BERNVILLE ROAD | READING | PA | 19605 | PA - Reading | 143 | 24542.94 | 72797.93 | 7718.09 | 9056.326570 | 8.038351 |
| 38923 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | 220008 | STURDY MEMORIAL HOSPITAL | 211 PARK STREET | ATTLEBORO | MA | 2703 | RI - Providence | 11 | 7965.18 | 41482.09 | 1790.18 | 4389.910134 | 9.449417 |
Calculating the average amount spent per DRG and state. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within states
benchmark = df.groupby(['Provider_State','DRG'])['Average_Total_Payments'].mean().reset_index()
benchmark.columns = ['Provider_State','DRG','Avg_Payment_ByStateDRG']
benchmark.head()
| Provider_State | DRG | Avg_Payment_ByStateDRG | |
|---|---|---|---|
| 0 | AK | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 8401.950000 |
| 1 | AK | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 8799.000000 |
| 2 | AK | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 21608.010000 |
| 3 | AK | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 12555.548333 |
| 4 | AK | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 9275.867500 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_State','DRG'])
# Feature 2: ratio of amount spending and average spending by drg in a state
df1['drgState_avg_ratio_amount'] = np.where(df1['Avg_Payment_ByStateDRG']==0,0, df1['Average_Total_Payments'] / df1['Avg_Payment_ByStateDRG'])
df1['drgState_avg_ratio_amount'].describe()
count 163065.000000 mean 1.000000 std 0.195274 min 0.384805 25% 0.880824 50% 0.956830 75% 1.064688 max 7.857934 Name: drgState_avg_ratio_amount, dtype: float64
len(df1[df1['drgState_avg_ratio_amount']>2]['Provider_Name'].unique())
124
df1[df1['drgState_avg_ratio_amount']>7]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | Avg_Payment_ByDRG | drg_avg_ratio_amount | Avg_Payment_ByStateDRG | drgState_avg_ratio_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20388 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 390096 | ST JOSEPH MEDICAL CENTER | 2500 BERNVILLE ROAD | READING | PA | 19605 | PA - Reading | 143 | 24542.94 | 72797.93 | 7718.09 | 9056.326570 | 8.038351 | 9264.257684 | 7.857934 |
| 38923 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | 220008 | STURDY MEMORIAL HOSPITAL | 211 PARK STREET | ATTLEBORO | MA | 2703 | RI - Providence | 11 | 7965.18 | 41482.09 | 1790.18 | 4389.910134 | 9.449417 | 5711.287429 | 7.263177 |
Calculating the average amount spent per DRG and Hospital_referral_region_desp. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within Hospital_referral_region_desp
benchmark = df.groupby(['Hospital_referral_region_desp','DRG'])['Average_Total_Payments'].mean().reset_index()
benchmark.columns = ['Hospital_referral_region_desp','DRG','Avg_Payment_ByRegionDRG']
benchmark.head()
| Hospital_referral_region_desp | DRG | Avg_Payment_ByRegionDRG | |
|---|---|---|---|
| 0 | AK - Anchorage | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 8401.950000 |
| 1 | AK - Anchorage | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 8799.000000 |
| 2 | AK - Anchorage | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 21608.010000 |
| 3 | AK - Anchorage | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 12555.548333 |
| 4 | AK - Anchorage | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 9275.867500 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Hospital_referral_region_desp','DRG'])
# Feature 3: ratio of amount spending and average spending by drg in a Hospital_referral_region_desp
df1['drgRegion_avg_ratio_amount'] = np.where(df1['Avg_Payment_ByRegionDRG']==0,0, df1['Average_Total_Payments'] / df1['Avg_Payment_ByRegionDRG'])
df1['drgRegion_avg_ratio_amount'].describe()
count 163065.000000 mean 1.000000 std 0.159145 min 0.271255 25% 0.902856 50% 0.977701 75% 1.057561 max 4.848303 Name: drgRegion_avg_ratio_amount, dtype: float64
len(df1[df1['drgRegion_avg_ratio_amount']>2]['Provider_Name'].unique())
64
df1[df1['drgRegion_avg_ratio_amount']>4]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | Avg_Payment_ByDRG | drg_avg_ratio_amount | Avg_Payment_ByStateDRG | drgState_avg_ratio_amount | Avg_Payment_ByRegionDRG | drgRegion_avg_ratio_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37716 | 202 - BRONCHITIS & ASTHMA W CC/MCC | 210009 | JOHNS HOPKINS HOSPITAL, THE | 600 NORTH WOLFE STREET | BALTIMORE | MD | 21287 | MD - Baltimore | 17 | 43980.52 | 41458.47 | 37236.0 | 6011.322103 | 6.896731 | 8374.755758 | 4.950409 | 9473.928889 | 4.376059 |
| 162713 | 948 - SIGNS & SYMPTOMS W/O MCC | 390312 | CANCER TREATMENT CENTERS OF AMERICA | 1331 EAST WYOMING AVENUE | PHILADELPHIA | PA | 19124 | PA - Philadelphia | 24 | 83945.95 | 28968.20 | 8894.2 | 4824.766436 | 6.004063 | 5018.809241 | 5.771927 | 5974.915484 | 4.848303 |
Calculating the average amount spent per DRG and city. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within cities in a state. We are grouping by state and city to calculate average because there might be cities with same name in different states.
benchmark = df.groupby(['Provider_State','Provider_City','DRG'])['Average_Total_Payments'].mean().reset_index()
benchmark.columns = ['Provider_State','Provider_City','DRG','Avg_Payment_ByCityDRG']
benchmark.head()
| Provider_State | Provider_City | DRG | Avg_Payment_ByCityDRG | |
|---|---|---|---|---|
| 0 | AK | ANCHORAGE | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 8401.950000 |
| 1 | AK | ANCHORAGE | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 8799.000000 |
| 2 | AK | ANCHORAGE | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 17773.020000 |
| 3 | AK | ANCHORAGE | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 11348.766667 |
| 4 | AK | ANCHORAGE | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 8729.390000 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_State','Provider_City','DRG'])
# Feature 4: ratio of amount spending and average spending by drg in a city
df1['drgCity_avg_ratio_amount'] = np.where(df1['Avg_Payment_ByCityDRG']==0,0, df1['Average_Total_Payments'] / df1['Avg_Payment_ByCityDRG'])
df1['drgCity_avg_ratio_amount'].describe()
count 163065.000000 mean 1.000000 std 0.092549 min 0.238195 25% 1.000000 50% 1.000000 75% 1.000000 max 3.753099 Name: drgCity_avg_ratio_amount, dtype: float64
len(df1[df1['drgCity_avg_ratio_amount']>2]['Provider_Name'].unique())
14
df1[df1['drgCity_avg_ratio_amount']>3]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payment | Avg_Payment_ByDRG | drg_avg_ratio_amount | Avg_Payment_ByStateDRG | drgState_avg_ratio_amount | Avg_Payment_ByRegionDRG | drgRegion_avg_ratio_amount | Avg_Payment_ByCityDRG | drgCity_avg_ratio_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37716 | 202 - BRONCHITIS & ASTHMA W CC/MCC | 210009 | JOHNS HOPKINS HOSPITAL, THE | 600 NORTH WOLFE STREET | BALTIMORE | MD | 21287 | MD - Baltimore | 17 | 43980.52 | 41458.47 | 37236.0 | 6011.322103 | 6.896731 | 8374.755758 | 4.950409 | 9473.928889 | 4.376059 | 11046.464 | 3.753099 |
| 162713 | 948 - SIGNS & SYMPTOMS W/O MCC | 390312 | CANCER TREATMENT CENTERS OF AMERICA | 1331 EAST WYOMING AVENUE | PHILADELPHIA | PA | 19124 | PA - Philadelphia | 24 | 83945.95 | 28968.20 | 8894.2 | 4824.766436 | 6.004063 | 5018.809241 | 5.771927 | 5974.915484 | 4.848303 | 8350.448 | 3.469059 |
Calculating the average amount spent per DRG and Zipcode. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within cities in a Zipcode.
benchmark = df.groupby(['Provider_Zipcode','DRG'])['Average_Total_Payments'].mean().reset_index()
benchmark.columns = ['Provider_Zipcode','DRG','Avg_Payment_ByZipDRG']
benchmark.head()
| Provider_Zipcode | DRG | Avg_Payment_ByZipDRG | |
|---|---|---|---|
| 0 | 1040 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 7130.69 |
| 1 | 1040 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 11785.76 |
| 2 | 1040 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 7609.65 |
| 3 | 1040 | 069 - TRANSIENT ISCHEMIA | 4859.96 |
| 4 | 1040 | 101 - SEIZURES W/O MCC | 5059.14 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_Zipcode','DRG'])
# Feature 5: ratio of amount spending and average spending by drg in a city
df1['drgZip_avg_ratio_amount'] = np.where(df1['Avg_Payment_ByZipDRG']==0,0, df1['Average_Total_Payments'] / df1['Avg_Payment_ByZipDRG'])
df1['drgZip_avg_ratio_amount'].describe()
count 163065.000000 mean 1.000000 std 0.035360 min 0.303604 25% 1.000000 50% 1.000000 75% 1.000000 max 1.897813 Name: drgZip_avg_ratio_amount, dtype: float64
df1[df1['drgZip_avg_ratio_amount']>1.4999]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Payment_ByDRG | drg_avg_ratio_amount | Avg_Payment_ByStateDRG | drgState_avg_ratio_amount | Avg_Payment_ByRegionDRG | drgRegion_avg_ratio_amount | Avg_Payment_ByCityDRG | drgCity_avg_ratio_amount | Avg_Payment_ByZipDRG | drgZip_avg_ratio_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16057 | 177 - RESPIRATORY INFECTIONS & INFLAMMATIONS W... | 330013 | ALBANY MEDICAL CENTER HOSPITAL | 43 NEW SCOTLAND AVENUE | ALBANY | NY | 12208 | NY - Albany | 11 | 101070.36 | ... | 13829.385496 | 2.580684 | 17246.198333 | 2.069399 | 14782.019231 | 2.414370 | 23131.575000 | 1.542881 | 23131.575000 | 1.542881 |
| 29126 | 193 - SIMPLE PNEUMONIA & PLEURISY W MCC | 40016 | UAMS MEDICAL CENTER | 4301 WEST MARKHAM STREET MAIL SLOT 612 | LITTLE ROCK | AR | 72205 | AR - Little Rock | 68 | 49928.55 | ... | 10053.683177 | 1.700526 | 8698.876061 | 1.965375 | 8812.345789 | 1.940068 | 10363.245000 | 1.649729 | 11345.470000 | 1.506905 |
| 80031 | 312 - SYNCOPE & COLLAPSE | 40016 | UAMS MEDICAL CENTER | 4301 WEST MARKHAM STREET MAIL SLOT 612 | LITTLE ROCK | AR | 72205 | AR - Little Rock | 31 | 45348.58 | ... | 5003.199240 | 2.876384 | 4592.729200 | 3.133457 | 4803.203750 | 2.996150 | 6644.077500 | 2.166007 | 7603.646667 | 1.892660 |
| 92330 | 377 - G.I. HEMORRHAGE W MCC | 330350 | UNIVERSITY HOSPITAL OF BROOKLYN ( DOWNSTATE ) | 445 LENOX ROAD | BROOKLYN | NY | 11203 | NY - Manhattan | 25 | 114716.72 | ... | 12490.130542 | 4.898011 | 16231.499900 | 3.769017 | 22181.768333 | 2.757977 | 24799.446923 | 2.466861 | 34163.193333 | 1.790723 |
| 109462 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 150166 | PINNACLE HOSPITAL | 9301 CONNECTICUT DR | CROWN POINT | IN | 46307 | IN - Gary | 14 | 225729.14 | ... | 27778.671119 | 4.722593 | 29810.952564 | 4.400643 | 43758.768333 | 2.997967 | 77332.980000 | 1.696396 | 77332.980000 | 1.696396 |
| 113739 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 450880 | BAYLOR SURGICAL HOSPITAL AT FORT WORTH | 750 12TH AVENUE | FORT WORTH | TX | 76104 | TX - Fort Worth | 113 | 39467.76 | ... | 14566.929215 | 2.655761 | 13950.560045 | 2.773099 | 15010.779545 | 2.577233 | 17619.263750 | 2.195681 | 20384.666000 | 1.897813 |
| 128071 | 638 - DIABETES W CC | 140030 | SHERMAN HOSPITAL | 1425 NORTH RANDALL ROAD | ELGIN | IL | 60123 | IL - Elgin | 23 | 28618.86 | ... | 5889.570467 | 2.442674 | 6097.494778 | 2.359379 | 7350.992500 | 1.957055 | 9542.180000 | 1.507653 | 9542.180000 | 1.507653 |
| 130909 | 640 - MISC DISORDERS OF NUTRITION,METABOLISM,F... | 450068 | MEMORIAL HERMANN TEXAS MEDICAL CENTER | 6411 FANNIN | HOUSTON | TX | 77030 | TX - Houston | 56 | 56178.03 | ... | 8129.881570 | 2.309416 | 7640.931343 | 2.457198 | 8319.361667 | 2.256817 | 9190.058462 | 2.042999 | 12036.830000 | 1.559819 |
8 rows × 22 columns
Calculating the average discarges per DRG irrespective of a state. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharge comparisons across states
benchmark = df.groupby(['DRG'])['Total_Discharges'].mean().reset_index()
benchmark.columns = ['DRG','Avg_Discharges_ByDRG']
benchmark.head()
| DRG | Avg_Discharges_ByDRG | |
|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 31.145505 |
| 1 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 25.155704 |
| 2 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 37.427969 |
| 3 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 46.899074 |
| 4 | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 30.924142 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['DRG'])
# Feature 6: ratio of discharges and average discharges by drg in a city
df1['drg_avg_ratio_discharges'] = np.where(df1['Avg_Discharges_ByDRG']==0,0, df1['Total_Discharges'] / df1['Avg_Discharges_ByDRG'])
df1['drg_avg_ratio_discharges'].describe()
count 163065.000000 mean 1.000000 std 0.760072 min 0.070809 25% 0.523420 50% 0.788843 75% 1.238076 max 44.569469 Name: drg_avg_ratio_discharges, dtype: float64
len(df1[df1['drg_avg_ratio_discharges']>2]['Provider_Name'].unique())
1259
df1[df1['drg_avg_ratio_discharges']>44]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Payment_ByStateDRG | drgState_avg_ratio_amount | Avg_Payment_ByRegionDRG | drgRegion_avg_ratio_amount | Avg_Payment_ByCityDRG | drgCity_avg_ratio_amount | Avg_Payment_ByZipDRG | drgZip_avg_ratio_amount | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 158518 | 897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REH... | 220062 | ADCARE HOSPITAL OF WORCESTER INC | 107 LINCOLN STREET | WORCESTER | MA | 1605 | MA - Worcester | 1571 | 4227.46 | ... | 5161.984545 | 0.962335 | 5298.505 | 0.93754 | 5931.033333 | 0.837554 | 4967.56 | 1.0 | 35.248344 | 44.569469 |
1 rows × 24 columns
Calculating the average discharges spent per DRG and state. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within states
benchmark = df.groupby(['Provider_State','DRG'])['Total_Discharges'].mean().reset_index()
benchmark.columns = ['Provider_State','DRG','Avg_Discharges_ByStateDRG']
benchmark.head()
| Provider_State | DRG | Avg_Discharges_ByStateDRG | |
|---|---|---|---|
| 0 | AK | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 23.000000 |
| 1 | AK | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 11.000000 |
| 2 | AK | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 22.500000 |
| 3 | AK | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 25.333333 |
| 4 | AK | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 21.000000 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_State','DRG'])
# Feature 7: ratio of total discharges and average discharges by drg in a state
df1['drgState_avg_ratio_discharges'] = np.where(df1['Avg_Discharges_ByStateDRG']==0,0, df1['Total_Discharges'] / df1['Avg_Discharges_ByStateDRG'])
df1['drgState_avg_ratio_discharges'].describe()
count 163065.000000 mean 1.000000 std 0.686087 min 0.045025 25% 0.550452 50% 0.819820 75% 1.251471 max 24.624593 Name: drgState_avg_ratio_discharges, dtype: float64
len(df1[df1['drgState_avg_ratio_discharges']>2]['Provider_Name'].unique())
1227
df1[df1['drgState_avg_ratio_discharges']>24]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Payment_ByRegionDRG | drgRegion_avg_ratio_amount | Avg_Payment_ByCityDRG | drgCity_avg_ratio_amount | Avg_Payment_ByZipDRG | drgZip_avg_ratio_amount | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | Avg_Discharges_ByStateDRG | drgState_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 112969 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 330270 | HOSPITAL FOR SPECIAL SURGERY | 535 EAST 70TH STREET | NEW YORK | NY | 10021 | NY - Manhattan | 3383 | 53113.58 | ... | 22709.351 | 0.837703 | 23000.354444 | 0.827105 | 20525.163333 | 0.926848 | 155.348 | 21.776914 | 137.382979 | 24.624593 |
1 rows × 26 columns
Calculating the average discharges spent per DRG and Hospital_referral_region_desp. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within Hospital_referral_region_desp
benchmark = df.groupby(['Hospital_referral_region_desp','DRG'])['Total_Discharges'].mean().reset_index()
benchmark.columns = ['Hospital_referral_region_desp','DRG','Avg_Discharges_ByRegionDRG']
benchmark.head()
| Hospital_referral_region_desp | DRG | Avg_Discharges_ByRegionDRG | |
|---|---|---|---|
| 0 | AK - Anchorage | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 23.000000 |
| 1 | AK - Anchorage | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 11.000000 |
| 2 | AK - Anchorage | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 22.500000 |
| 3 | AK - Anchorage | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 25.333333 |
| 4 | AK - Anchorage | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 21.000000 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Hospital_referral_region_desp','DRG'])
# Feature 8: ratio of amount spending and average spending by drg in a Hospital_referral_region_desp
df1['drgRegion_avg_ratio_discharges'] = np.where(df1['Avg_Discharges_ByRegionDRG']==0,0, df1['Total_Discharges'] / df1['Avg_Discharges_ByRegionDRG'])
df1['drgRegion_avg_ratio_discharges'].describe()
count 163065.000000 mean 1.000000 std 0.596288 min 0.015412 25% 0.592105 50% 0.886894 75% 1.250000 max 11.008786 Name: drgRegion_avg_ratio_discharges, dtype: float64
len(df1[df1['drgState_avg_ratio_discharges']>11]['Provider_Name'].unique())
4
df1[df1['drgState_avg_ratio_discharges']>11]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Payment_ByCityDRG | drgCity_avg_ratio_amount | Avg_Payment_ByZipDRG | drgZip_avg_ratio_amount | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | Avg_Discharges_ByStateDRG | drgState_avg_ratio_discharges | Avg_Discharges_ByRegionDRG | drgRegion_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 80295 | 312 - SYNCOPE & COLLAPSE | 100007 | FLORIDA HOSPITAL | 601 E ROLLINS ST | ORLANDO | FL | 32803 | FL - Orlando | 984 | 24118.51 | ... | 4801.620000 | 0.963185 | 4624.850000 | 1.000000 | 62.027098 | 15.864034 | 85.506494 | 11.507898 | 116.321429 | 8.459318 |
| 112969 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 330270 | HOSPITAL FOR SPECIAL SURGERY | 535 EAST 70TH STREET | NEW YORK | NY | 10021 | NY - Manhattan | 3383 | 53113.58 | ... | 23000.354444 | 0.827105 | 20525.163333 | 0.926848 | 155.348000 | 21.776914 | 137.382979 | 24.624593 | 307.300000 | 11.008786 |
| 158235 | 897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REH... | 50752 | BROTMAN MEDICAL CENTER | 3828 DELMAS TERRACE | CULVER CITY | CA | 90231 | CA - Los Angeles | 510 | 19917.41 | ... | 4986.280000 | 1.000000 | 4986.280000 | 1.000000 | 35.248344 | 14.468765 | 33.161290 | 15.379377 | 55.125000 | 9.251701 |
| 158518 | 897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REH... | 220062 | ADCARE HOSPITAL OF WORCESTER INC | 107 LINCOLN STREET | WORCESTER | MA | 1605 | MA - Worcester | 1571 | 4227.46 | ... | 5931.033333 | 0.837554 | 4967.560000 | 1.000000 | 35.248344 | 44.569469 | 68.045455 | 23.087508 | 281.666667 | 5.577515 |
4 rows × 28 columns
Calculating the average discharges spent per DRG and city. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within cities in a state. We are grouping by state and city to calculate average because there might be cities with same name in different states.
benchmark = df.groupby(['Provider_State','Provider_City','DRG'])['Total_Discharges'].mean().reset_index()
benchmark.columns = ['Provider_State','Provider_City','DRG','Avg_Discharges_ByCityDRG']
benchmark.head()
| Provider_State | Provider_City | DRG | Avg_Discharges_ByCityDRG | |
|---|---|---|---|---|
| 0 | AK | ANCHORAGE | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 23.000000 |
| 1 | AK | ANCHORAGE | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 11.000000 |
| 2 | AK | ANCHORAGE | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 34.000000 |
| 3 | AK | ANCHORAGE | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 33.666667 |
| 4 | AK | ANCHORAGE | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 27.500000 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_State','Provider_City','DRG'])
# Feature 9: ratio of amount spending and average spending by drg in a city
df1['drgCity_avg_ratio_discharges'] = np.where(df1['Avg_Discharges_ByCityDRG']==0,0, df1['Total_Discharges'] / df1['Avg_Discharges_ByCityDRG'])
df1['drgCity_avg_ratio_discharges'].describe()
count 163065.000000 mean 1.000000 std 0.295053 min 0.019540 25% 1.000000 50% 1.000000 75% 1.000000 max 5.622857 Name: drgCity_avg_ratio_discharges, dtype: float64
len(df1[df1['drgCity_avg_ratio_discharges']>2]['Provider_Name'].unique())
285
df1[df1['drgCity_avg_ratio_discharges']>5]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Payment_ByZipDRG | drgZip_avg_ratio_amount | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | Avg_Discharges_ByStateDRG | drgState_avg_ratio_discharges | Avg_Discharges_ByRegionDRG | drgRegion_avg_ratio_discharges | Avg_Discharges_ByCityDRG | drgCity_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102629 | 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DIS... | 140281 | NORTHWESTERN MEMORIAL HOSPITAL | 251 E HURON ST | CHICAGO | IL | 60611 | IL - Chicago | 448 | 20482.50 | ... | 6321.750000 | 1.000000 | 83.001356 | 5.397502 | 106.760000 | 4.196328 | 87.370370 | 5.127596 | 87.370370 | 5.127596 |
| 111610 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 50625 | CEDARS-SINAI MEDICAL CENTER | 8700 BEVERLY BLVD | LOS ANGELES | CA | 90048 | CA - Los Angeles | 738 | 110123.21 | ... | 18690.780000 | 1.000000 | 155.348000 | 4.750624 | 126.514894 | 5.833305 | 108.338983 | 6.811952 | 131.250000 | 5.622857 |
| 112969 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 330270 | HOSPITAL FOR SPECIAL SURGERY | 535 EAST 70TH STREET | NEW YORK | NY | 10021 | NY - Manhattan | 3383 | 53113.58 | ... | 20525.163333 | 0.926848 | 155.348000 | 21.776914 | 137.382979 | 24.624593 | 307.300000 | 11.008786 | 614.111111 | 5.508775 |
| 125144 | 603 - CELLULITIS W/O MCC | 50625 | CEDARS-SINAI MEDICAL CENTER | 8700 BEVERLY BLVD | LOS ANGELES | CA | 90048 | CA - Los Angeles | 272 | 55841.68 | ... | 8035.030000 | 1.000000 | 50.193801 | 5.418996 | 41.864407 | 6.497166 | 45.582090 | 5.967256 | 51.812500 | 5.249698 |
| 154650 | 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ H... | 450184 | MEMORIAL HERMANN HOSPITAL SYSTEM | 1635 NORTH LOOP WEST | HOUSTON | TX | 77008 | TX - Houston | 754 | 35003.01 | ... | 12726.680000 | 1.000000 | 113.467994 | 6.645046 | 107.148148 | 7.036986 | 98.021277 | 7.692208 | 145.266667 | 5.190454 |
5 rows × 30 columns
Calculating the average discharges spent per DRG and Zipcode. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within cities in a Zipcode.
benchmark = df.groupby(['Provider_Zipcode','DRG'])['Total_Discharges'].mean().reset_index()
benchmark.columns = ['Provider_Zipcode','DRG','Avg_Discharges_ByZipDRG']
benchmark.head()
| Provider_Zipcode | DRG | Avg_Discharges_ByZipDRG | |
|---|---|---|---|
| 0 | 1040 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 13.0 |
| 1 | 1040 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 17.0 |
| 2 | 1040 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 26.0 |
| 3 | 1040 | 069 - TRANSIENT ISCHEMIA | 26.0 |
| 4 | 1040 | 101 - SEIZURES W/O MCC | 21.0 |
# add the new feature average spending to the dataset
df1 = pd.merge(df1, benchmark, how='left', on=['Provider_Zipcode','DRG'])
# Feature 10: ratio of amount spending and average spending by drg in a city
df1['drgZip_avg_ratio_discharges'] = np.where(df1['Avg_Discharges_ByZipDRG']==0,0, df1['Total_Discharges'] / df1['Avg_Discharges_ByZipDRG'])
df1['drgZip_avg_ratio_discharges'].describe()
count 163065.000000 mean 1.000000 std 0.122841 min 0.028081 25% 1.000000 50% 1.000000 75% 1.000000 max 3.019324 Name: drgZip_avg_ratio_discharges, dtype: float64
len(df1[df1['drgZip_avg_ratio_discharges']>2]['Provider_Name'].unique())
12
df1[df1['drgZip_avg_ratio_discharges']>3]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | Avg_Discharges_ByStateDRG | drgState_avg_ratio_discharges | Avg_Discharges_ByRegionDRG | drgRegion_avg_ratio_discharges | Avg_Discharges_ByCityDRG | drgCity_avg_ratio_discharges | Avg_Discharges_ByZipDRG | drgZip_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113682 | 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 450672 | PLAZA MEDICAL CENTER OF FORT WORTH | 900 EIGHTH AVENUE | FORT WORTH | TX | 76104 | TX - Fort Worth | 750 | 121140.24 | ... | 155.348 | 4.82787 | 134.352679 | 5.582323 | 108.136364 | 6.935687 | 175.75 | 4.267425 | 248.4 | 3.019324 |
1 rows × 32 columns
features2 = ['drg_avg_ratio_amount',
'drgState_avg_ratio_amount',
'drgRegion_avg_ratio_amount',
'drgCity_avg_ratio_amount',
'drgZip_avg_ratio_amount',
'drg_avg_ratio_discharges',
'drgState_avg_ratio_discharges',
'drgRegion_avg_ratio_discharges',
'drgCity_avg_ratio_discharges',
'drgZip_avg_ratio_discharges']
df1[(df1['drg_avg_ratio_amount']>2) & (df1['drg_avg_ratio_discharges']>2)]['Provider_Name'].unique()
array(['UNIVERSITY OF MARYLAND MEDICAL CENTER',
'JOHNS HOPKINS HOSPITAL, THE',
'JOHNS HOPKINS BAYVIEW MEDICAL CENTER',
'MAIMONIDES MEDICAL CENTER', 'UMASS MEMORIAL MEDICAL CENTER INC',
'NYU HOSPITALS CENTER', 'CEDARS-SINAI MEDICAL CENTER',
'STANFORD HOSPITAL', 'ST JOSEPH MEDICAL CENTER',
'AVERA HEART HOSPITAL OF SOUTH DAKOTA LLC',
'MEDICAL COLLEGE OF VIRGINIA HOSPITALS',
'NEW YORK-PRESBYTERIAN HOSPITAL', 'BRONX-LEBANON HOSPITAL CENTER',
'UCSF MEDICAL CENTER', 'JACKSON MEMORIAL HOSPITAL',
'MEMORIAL HERMANN TEXAS MEDICAL CENTER',
'UNIVERSITY OF ILLINOIS HOSPITAL'], dtype=object)
for i in features2:
plt.figure(figsize=(15,5))
plt.figure(i)
sns.histplot(df1[i])
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
<Figure size 1080x360 with 0 Axes>
Most of the cities, over 75% have similar charges a specific DRG but there stands an outlier costing 9 times the average. 195 hospitals cost double over other and 2 stand out extremely ST JOSEPH MEDICAL CENTER, PA costs 8 times more and STURDY MEMORIAL HOSPITAL, MA 9 times more.
Most hospitals cost, over 75% have similar charges in a given state for particular ailment. over 124 cost over double their states average and again ST JOSEPH MEDICAL CENTER, PA and STURDY MEMORIAL HOSPITAL, MA stand out as extremes.
Most charges within a given hospital referral zone charges are constant, 64 charge double and 2 stand as extreme outliers THE JOHNS HOPKINS HOSPITAL,MD and CANCER TREATMENT CENTERS OF AMERICA, PA charging quadriple times the cost.
Most hospitals cost similar charges in a given city for particular DRG. over 14 of them cost over double their city's average and again THE JOHNS HOPKINS HOSPITAL,MD and CANCER TREATMENT CENTERS OF AMERICA, PA stand out as extremes charging triple the cost.
All hospitals in a given charge same price, just 8 hospitals charge 50% more that other hospitals.
Most of the hospitals, over 75% have similar discharges a specific DRG but there stands an outlier attending 44 times the average. 1,259 hospitals have double discharges over others in their cities and ADCARE HOSPITAL OF WORCESTER INC stand out 44.5 times discharges for alcoholism related treatments.
Most hospitals have similar discharges rates over 75% in a given state for particular ailment. over 1,227 have over twice their states average and HOSPITAL FOR SPECIAL SURGERY, NY stands out in joint replacement and reattachments.
Most discharges within a given hospital referral zone charges are constant, 1,227 have double discharge rate and 4 stand as extreme outliers having 11times more discharge rates, HOSPITAL FOR SPECIAL SURGERY, NY and ADCARE HOSPITAL OF WORCESTER INC, MA are to be of special mention of the 4.
Most hospitals cost similar charges in a given city for particular DRG. over 285 of them cost over double their city's average. 5 hospitals charges 5 times their cities averges and HOSPITAL FOR SPECIAL SURGERY, NY appears again as an outlier.
All hospitals in a given Zipcode have similar discharge rates,12 have twice the discharges in a given zip. PLAZA MEDICAL CENTER OF FORT WORTH FW, TX has trice the dicharge rate for major joint replacement and reattachments
# selecting necessary features
df2 = df1[features2]
from sklearn.model_selection import train_test_split
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
X_train, X_test = train_test_split(df2,test_size = 0.3, random_state=0)
clf = PCA()
clf.fit(X_train)
PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=None, n_selected_components=None, random_state=None, standardization=True, svd_solver='auto', tol=0.0, weighted=True, whiten=False)
y_train_pred = clf.labels_
# If you want to see the predictions of the training data, you can use this way:
y_train_scores = clf.decision_scores_
#Creating a pandas data from train score and describing it.
y_df = pd.DataFrame(y_train_scores)
y_df.columns = ['score']
y_df.describe()
| score | |
|---|---|
| count | 114145.000000 |
| mean | 1222.549132 |
| std | 814.137294 |
| min | 383.928116 |
| 25% | 748.116115 |
| 50% | 953.797147 |
| 75% | 1378.077308 |
| max | 29278.883575 |
y_df[y_df['score']>25000]
| score | |
|---|---|
| 103728 | 29278.883575 |
X_train.iloc[[103728]]
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|
| 158518 | 1.003566 | 0.962335 | 0.93754 | 0.837554 | 1.0 | 44.569469 | 23.087508 | 5.577515 | 2.863305 | 1.0 |
df1.iloc[[158518]]
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Avg_Discharges_ByDRG | drg_avg_ratio_discharges | Avg_Discharges_ByStateDRG | drgState_avg_ratio_discharges | Avg_Discharges_ByRegionDRG | drgRegion_avg_ratio_discharges | Avg_Discharges_ByCityDRG | drgCity_avg_ratio_discharges | Avg_Discharges_ByZipDRG | drgZip_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 158518 | 897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REH... | 220062 | ADCARE HOSPITAL OF WORCESTER INC | 107 LINCOLN STREET | WORCESTER | MA | 1605 | MA - Worcester | 1571 | 4227.46 | ... | 35.248344 | 44.569469 | 68.045455 | 23.087508 | 281.666667 | 5.577515 | 548.666667 | 2.863305 | 1571.0 | 1.0 |
1 rows × 32 columns
# Creating histogram plot for the entire dataset
plt.figure(figsize=(15,10))
plt.hist(y_train_scores, bins='auto', color='b', edgecolor='red',linewidth=1) # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
def pca_scatter_plot(i,x,y,y_train_scores,X_train):
plt.figure(i)
sns.scatterplot(x=x, y=y, hue=y_train_scores, data=X_train, palette='RdBu_r');
plt.title('Anomaly Scores by PCA');
# Creating a plot between all the variables resulting in 100 images
i=0
for f in features2:
for e in features2:
print(f, e)
pca_scatter_plot(i,f,e,y_train_scores,X_train)
i+=1
drg_avg_ratio_amount drg_avg_ratio_amount drg_avg_ratio_amount drgState_avg_ratio_amount drg_avg_ratio_amount drgRegion_avg_ratio_amount drg_avg_ratio_amount drgCity_avg_ratio_amount drg_avg_ratio_amount drgZip_avg_ratio_amount drg_avg_ratio_amount drg_avg_ratio_discharges drg_avg_ratio_amount drgState_avg_ratio_discharges drg_avg_ratio_amount drgRegion_avg_ratio_discharges drg_avg_ratio_amount drgCity_avg_ratio_discharges drg_avg_ratio_amount drgZip_avg_ratio_discharges drgState_avg_ratio_amount drg_avg_ratio_amount drgState_avg_ratio_amount drgState_avg_ratio_amount drgState_avg_ratio_amount drgRegion_avg_ratio_amount drgState_avg_ratio_amount drgCity_avg_ratio_amount drgState_avg_ratio_amount drgZip_avg_ratio_amount drgState_avg_ratio_amount drg_avg_ratio_discharges drgState_avg_ratio_amount drgState_avg_ratio_discharges drgState_avg_ratio_amount drgRegion_avg_ratio_discharges drgState_avg_ratio_amount drgCity_avg_ratio_discharges drgState_avg_ratio_amount drgZip_avg_ratio_discharges drgRegion_avg_ratio_amount drg_avg_ratio_amount
<ipython-input-63-6d84483c59fd>:2: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). plt.figure(i)
drgRegion_avg_ratio_amount drgState_avg_ratio_amount drgRegion_avg_ratio_amount drgRegion_avg_ratio_amount drgRegion_avg_ratio_amount drgCity_avg_ratio_amount drgRegion_avg_ratio_amount drgZip_avg_ratio_amount drgRegion_avg_ratio_amount drg_avg_ratio_discharges drgRegion_avg_ratio_amount drgState_avg_ratio_discharges drgRegion_avg_ratio_amount drgRegion_avg_ratio_discharges drgRegion_avg_ratio_amount drgCity_avg_ratio_discharges drgRegion_avg_ratio_amount drgZip_avg_ratio_discharges drgCity_avg_ratio_amount drg_avg_ratio_amount drgCity_avg_ratio_amount drgState_avg_ratio_amount drgCity_avg_ratio_amount drgRegion_avg_ratio_amount drgCity_avg_ratio_amount drgCity_avg_ratio_amount drgCity_avg_ratio_amount drgZip_avg_ratio_amount drgCity_avg_ratio_amount drg_avg_ratio_discharges drgCity_avg_ratio_amount drgState_avg_ratio_discharges drgCity_avg_ratio_amount drgRegion_avg_ratio_discharges drgCity_avg_ratio_amount drgCity_avg_ratio_discharges drgCity_avg_ratio_amount drgZip_avg_ratio_discharges drgZip_avg_ratio_amount drg_avg_ratio_amount drgZip_avg_ratio_amount drgState_avg_ratio_amount drgZip_avg_ratio_amount drgRegion_avg_ratio_amount drgZip_avg_ratio_amount drgCity_avg_ratio_amount drgZip_avg_ratio_amount drgZip_avg_ratio_amount drgZip_avg_ratio_amount drg_avg_ratio_discharges drgZip_avg_ratio_amount drgState_avg_ratio_discharges drgZip_avg_ratio_amount drgRegion_avg_ratio_discharges drgZip_avg_ratio_amount drgCity_avg_ratio_discharges drgZip_avg_ratio_amount drgZip_avg_ratio_discharges drg_avg_ratio_discharges drg_avg_ratio_amount drg_avg_ratio_discharges drgState_avg_ratio_amount drg_avg_ratio_discharges drgRegion_avg_ratio_amount drg_avg_ratio_discharges drgCity_avg_ratio_amount drg_avg_ratio_discharges drgZip_avg_ratio_amount drg_avg_ratio_discharges drg_avg_ratio_discharges drg_avg_ratio_discharges drgState_avg_ratio_discharges drg_avg_ratio_discharges drgRegion_avg_ratio_discharges drg_avg_ratio_discharges drgCity_avg_ratio_discharges drg_avg_ratio_discharges drgZip_avg_ratio_discharges drgState_avg_ratio_discharges drg_avg_ratio_amount drgState_avg_ratio_discharges drgState_avg_ratio_amount drgState_avg_ratio_discharges drgRegion_avg_ratio_amount drgState_avg_ratio_discharges drgCity_avg_ratio_amount drgState_avg_ratio_discharges drgZip_avg_ratio_amount drgState_avg_ratio_discharges drg_avg_ratio_discharges drgState_avg_ratio_discharges drgState_avg_ratio_discharges drgState_avg_ratio_discharges drgRegion_avg_ratio_discharges drgState_avg_ratio_discharges drgCity_avg_ratio_discharges drgState_avg_ratio_discharges drgZip_avg_ratio_discharges drgRegion_avg_ratio_discharges drg_avg_ratio_amount drgRegion_avg_ratio_discharges drgState_avg_ratio_amount drgRegion_avg_ratio_discharges drgRegion_avg_ratio_amount drgRegion_avg_ratio_discharges drgCity_avg_ratio_amount drgRegion_avg_ratio_discharges drgZip_avg_ratio_amount drgRegion_avg_ratio_discharges drg_avg_ratio_discharges drgRegion_avg_ratio_discharges drgState_avg_ratio_discharges drgRegion_avg_ratio_discharges drgRegion_avg_ratio_discharges drgRegion_avg_ratio_discharges drgCity_avg_ratio_discharges drgRegion_avg_ratio_discharges drgZip_avg_ratio_discharges drgCity_avg_ratio_discharges drg_avg_ratio_amount drgCity_avg_ratio_discharges drgState_avg_ratio_amount drgCity_avg_ratio_discharges drgRegion_avg_ratio_amount drgCity_avg_ratio_discharges drgCity_avg_ratio_amount drgCity_avg_ratio_discharges drgZip_avg_ratio_amount drgCity_avg_ratio_discharges drg_avg_ratio_discharges drgCity_avg_ratio_discharges drgState_avg_ratio_discharges drgCity_avg_ratio_discharges drgRegion_avg_ratio_discharges drgCity_avg_ratio_discharges drgCity_avg_ratio_discharges drgCity_avg_ratio_discharges drgZip_avg_ratio_discharges drgZip_avg_ratio_discharges drg_avg_ratio_amount drgZip_avg_ratio_discharges drgState_avg_ratio_amount drgZip_avg_ratio_discharges drgRegion_avg_ratio_amount drgZip_avg_ratio_discharges drgCity_avg_ratio_amount drgZip_avg_ratio_discharges drgZip_avg_ratio_amount drgZip_avg_ratio_discharges drg_avg_ratio_discharges drgZip_avg_ratio_discharges drgState_avg_ratio_discharges drgZip_avg_ratio_discharges drgRegion_avg_ratio_discharges drgZip_avg_ratio_discharges drgCity_avg_ratio_discharges drgZip_avg_ratio_discharges drgZip_avg_ratio_discharges
df2.describe()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 |
| mean | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| std | 0.248259 | 0.195274 | 0.159145 | 0.092549 | 0.035360 | 0.760072 | 0.686087 | 0.596288 | 0.295053 | 0.122841 |
| min | 0.464005 | 0.384805 | 0.271255 | 0.238195 | 0.303604 | 0.070809 | 0.045025 | 0.015412 | 0.019540 | 0.028081 |
| 25% | 0.839686 | 0.880824 | 0.902856 | 1.000000 | 1.000000 | 0.523420 | 0.550452 | 0.592105 | 1.000000 | 1.000000 |
| 50% | 0.928391 | 0.956830 | 0.977701 | 1.000000 | 1.000000 | 0.788843 | 0.819820 | 0.886894 | 1.000000 | 1.000000 |
| 75% | 1.085989 | 1.064688 | 1.057561 | 1.000000 | 1.000000 | 1.238076 | 1.251471 | 1.250000 | 1.000000 | 1.000000 |
| max | 9.449417 | 7.857934 | 4.848303 | 3.753099 | 1.897813 | 44.569469 | 24.624593 | 11.008786 | 5.622857 | 3.019324 |
y_train_scores.describ
array([-19.25478762, -18.67586365, -8.94076849, ..., -6.48192319,
-18.15062833, -12.6979405 ])
# Now we have the trained PCA model, let's apply to the test data to get the predictions
y_test_pred = clf.predict(X_test)
# And you can generate the anomaly score using clf.decision_function:
y_test_scores = clf.decision_function(X_test)
# Creating histogram plot for the entire dataset
plt.hist(y_test_scores, bins='auto', color='b', edgecolor='red',linewidth=1 ,width = 1) # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
# Implementing a cut-off of 5000 to plot the data.
plt.hist(y_test_scores[y_test_scores<5000], bins='auto', color='b') # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
df_test = pd.DataFrame(X_test)
df_test['score'] = y_test_scores
df_test['cluster'] = np.where(df_test['score']<5000, 0, 1)
df_test['cluster'].value_counts()
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||||
| 0 | 0.995788 | 0.996082 | 0.997402 | 0.998881 | 0.999699 | 0.986215 | 0.986857 | 0.989791 | 0.995465 | 0.999308 | 1186.152363 |
| 1 | 1.721466 | 1.624544 | 1.432264 | 1.218105 | 1.056132 | 3.428737 | 3.265358 | 2.856968 | 1.890639 | 1.128286 | 6483.757887 |
df_test['cluster'].value_counts()
0 48682 1 238 Name: cluster, dtype: int64
It is an unsupervised distance-based algorithm to capture the outliers. A suitable option for treating global outlier
#import the hbos model
from pyod.models.hbos import HBOS
# creating 50 bins
n_bins = 50
# initiating HBOS method and passing 50 as bins
hbos = HBOS(n_bins=n_bins)
# passing training dataset to HBOS model
hbos.fit(X_train)
HBOS(alpha=0.1, contamination=0.1, n_bins=n_bins, tol=0.5)
# And you can generate the anomaly score using clf.decision_function:
y_train_scores = hbos.decision_function(X_train)
#selecting the necessary features
X_test = X_test[features2]
# Now we have the trained hbos model, let's apply to the test data to get the predictions
y_test_pred = hbos.predict(X_test)
# Because it is '0' and '1', we can run a count statistic.
unique, counts = np.unique(y_test_pred, return_counts=True)
dict(zip(unique, counts))
# And you can generate the anomaly score using clf.decision_function:
y_test_scores = clf.decision_function(X_test)
# Creating histogram plot for the entire dataset
plt.hist(y_test_scores, bins='auto', color='b', edgecolor='red') # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
# Implementing a cut-off of 5000 to plot the data.
plt.hist(y_test_scores[y_test_scores<5000], bins='auto', color='b') # arguments are passed to np.histogram
plt.title("Histogram with 'auto' bins")
plt.show()
df_test = pd.DataFrame(X_test)
df_test['score'] = y_test_scores
df_test['cluster'] = np.where(df_test['score']<5000, 0, 1)
df_test['cluster'].value_counts()
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||||
| 0 | 0.995788 | 0.996082 | 0.997402 | 0.998881 | 0.999699 | 0.986215 | 0.986857 | 0.989791 | 0.995465 | 0.999308 | 1186.152363 |
| 1 | 1.721466 | 1.624544 | 1.432264 | 1.218105 | 1.056132 | 3.428737 | 3.265358 | 2.856968 | 1.890639 | 1.128286 | 6483.757887 |
df_test['cluster'].value_counts()
0 48682 1 238 Name: cluster, dtype: int64
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
# Standardize data
X_train_norm, X_test_norm = standardizer(X_train, X_test[features2])
# Test a range of bins
n_clf = 10
k_list = [5,10,1,20,25,30,35,40,45,50]
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])
train_scores.shape
# Modeling
for i in range(n_clf):
k = k_list[i]
hbos = HBOS(n_bins=k)
hbos.fit(X_train_norm)
# Store the results in each column:
train_scores[:, i] = hbos.decision_scores_
test_scores[:, i] = hbos.decision_function(X_test_norm)
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
Combination by average The test_scores_norm is 48920 x 10. The "average" function will take the average of the 10 columns. The result "y_by_average" is a single column:
y_by_average = average(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_by_average, bins='auto') # arguments are passed to np.histogram
plt.title("Combination by average")
plt.show()
# I have chosen 3 as limit because the histogram asymptote is from 3
df_test['y_by_average_score'] = y_by_average
df_test['y_by_average_cluster'] = np.where(df_test['y_by_average_score']<3, 0, 1)
df_test['y_by_average_cluster'].value_counts()
0 48629 1 291 Name: y_by_average_cluster, dtype: int64
In our case it identifies 291 data points that have the outlier scores higher than 3. In order to get the summary statistics for each cluster, we do the following code, which produces the average values as below.
df_test.groupby("y_by_average_cluster").mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | cluster | y_by_average_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| y_by_average_cluster | |||||||||||||
| 0 | 0.996884 | 0.997036 | 0.998111 | 0.999150 | 0.999779 | 0.985743 | 0.985632 | 0.988785 | 0.994309 | 0.997562 | 1189.871839 | 0.002756 | 0.069902 |
| 1 | 1.406215 | 1.350667 | 1.234707 | 1.133136 | 1.032396 | 3.062811 | 3.054960 | 2.684892 | 1.920782 | 1.396673 | 4897.340434 | 0.357388 | 3.340246 |
# Combination by mom
y_by_maximization = maximization(test_scores_norm)
plt.hist(y_by_maximization, bins='auto') # arguments are passed to np.histogram
plt.title("Combination by max")
plt.show()
# I have chosen 4 as limit because the histogram continuity ends with 4 and we find a gap
df_test['y_by_maximization_score'] = y_by_maximization
df_test['y_by_maximization_cluster'] = np.where(df_test['y_by_maximization_score']<4, 0, 1)
df_test['y_by_maximization_cluster'].value_counts()
0 48520 1 400 Name: y_by_maximization_cluster, dtype: int64
When we use the Maximum-of-Maximum method, we get 400 data points that have the outlier scores higher than 4. We use the following code to produce the summary statistics by cluster.
df_test.groupby('y_by_maximization_cluster').mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | cluster | y_by_average_score | y_by_average_cluster | y_by_maximization_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| y_by_maximization_cluster | |||||||||||||||
| 0 | 0.997041 | 0.997050 | 0.997947 | 0.999232 | 1.00011 | 0.981484 | 0.981408 | 0.984606 | 0.992456 | 0.996846 | 1183.370471 | 0.002329 | 0.064493 | 0.001298 | 1.222311 |
| 1 | 1.275638 | 1.252693 | 1.190115 | 1.086762 | 0.98340 | 3.013407 | 3.003438 | 2.729677 | 1.893075 | 1.374722 | 4675.671119 | 0.312500 | 3.105172 | 0.570000 | 4.675187 |
# Combination by aom
y_by_aom = aom(test_scores_norm, n_buckets=5)
plt.hist(y_by_aom, bins='auto') # arguments are passed to np.histogram
plt.title("Average of Maximum")
plt.show()
# I have chosen 3.5 as limit because the histogram asymptote is from 3.5
df_test['y_by_aom_score'] = y_by_aom
df_test['y_by_aom_cluster'] = np.where(df_test['y_by_aom_score']<3.5, 0, 1)
df_test['y_by_aom_cluster'].value_counts()
0 48662 1 258 Name: y_by_aom_cluster, dtype: int64
When we use the Average-of-Maximum method, we get 258 data points that have the outlier scores higher than 0. We use the following code to produce the summary statistics by cluster
df_test.groupby("y_by_aom_cluster").mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | cluster | y_by_average_score | y_by_average_cluster | y_by_maximization_score | y_by_maximization_cluster | y_by_aom_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| y_by_aom_cluster | |||||||||||||||||
| 0 | 0.997214 | 0.997270 | 0.998234 | 0.999227 | 0.999824 | 0.986406 | 0.986358 | 0.989484 | 0.994752 | 0.997805 | 1191.780899 | 0.002877 | 0.071954 | 0.000863 | 1.232075 | 0.003576 | 0.320773 |
| 1 | 1.396282 | 1.351755 | 1.241668 | 1.135863 | 1.028184 | 3.203400 | 3.182828 | 2.770024 | 1.955833 | 1.401815 | 5011.479077 | 0.379845 | 3.371538 | 0.965116 | 4.734140 | 0.875969 | 3.905668 |
# Combination by moa
y_by_moa = moa(test_scores_norm, n_buckets=5)
plt.hist(y_by_moa, bins='auto') # arguments are passed to np.histogram
plt.title("Maximum of Average")
plt.show()
# I have chosen 3.5 as limit because the histogram asymptote is from 3.5
df_test['y_by_moa_score'] = y_by_moa
df_test['y_by_moa_cluster'] = np.where(df_test['y_by_moa_score']<3.5, 0, 1)
df_test['y_by_moa_cluster'].value_counts()
0 48510 1 410 Name: y_by_moa_cluster, dtype: int64
When we use the Maximum-of-Average method, we get 410 data points that have the outlier scores higher than 3.5. We use the following code to produce the summary statistics by cluster.
df_test.groupby("y_by_moa_cluster").mean()
| drg_avg_ratio_amount | drgState_avg_ratio_amount | drgRegion_avg_ratio_amount | drgCity_avg_ratio_amount | drgZip_avg_ratio_amount | drg_avg_ratio_discharges | drgState_avg_ratio_discharges | drgRegion_avg_ratio_discharges | drgCity_avg_ratio_discharges | drgZip_avg_ratio_discharges | score | cluster | y_by_average_score | y_by_average_cluster | y_by_maximization_score | y_by_maximization_cluster | y_by_aom_score | y_by_aom_cluster | y_by_moa_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| y_by_moa_cluster | |||||||||||||||||||
| 0 | 0.996609 | 0.996679 | 0.997750 | 0.999018 | 0.999863 | 0.982898 | 0.982789 | 0.985936 | 0.993383 | 0.997095 | 1183.244619 | 0.002412 | 0.063224 | 0.000330 | 1.222468 | 0.001299 | 0.311379 | 0.000000 | 0.589354 |
| 1 | 1.319991 | 1.290348 | 1.208677 | 1.109860 | 1.013013 | 2.796484 | 2.790750 | 2.529678 | 1.761514 | 1.336060 | 4605.383562 | 0.295122 | 3.181135 | 0.670732 | 4.572472 | 0.821951 | 3.688066 | 0.629268 | 3.896672 |
We have implimented two unsupervised model PCA and HBO.
Leverageing PCA, 100 charts have been created and we oberve the largest outlier is greater than 25,000. we find that 'ADCARE HOSPITAL OF WORCESTER INC' as the outlier.
From HBOS Analysis we find that over 7% of the data can be classified as outliers.